—EH- 218 300 . 
' . AUTHOR 

' "title . ' 
spons. aqenc7 

* « . . . • • . 

PUB DATE ■ • 
• . GRANT , 
HOTE. 

I- 

f EQRS.PRICfi. 
DESCRIPTORS. 

■ IDENTIFIERS 



DOCUMENT RESUME 



TM 820 327 



Holliindr, ' Paul W. ; R^bin, Donald B, \ 
• Causal Inference in Prospective and Reti|0spective 
.Studies. 'J 
Educational Testing Service, Princettfa, N-i'^J.; 
National Inst, of Education (EJi) , Washington, 'DC. 
Aug 80 '^ ' 
NIE-G-78-0157 
47p. . . , . 

MF0i/PC02 Plus Postage. ^ * * • , 

^Mathematical Models; Research ^^Design; Research; 
, Methodology; Statistical Analysis; ^Statistical 
Studies * » 

^Causal Inferences ' • ' ' 



ABSTRACT r " S ^ ^ \ ' 

' * ^ Emphasi:f ing th^ measurement o£-causal effects to 

arrive at a better understandirng a| the causal mechanisms involved in 
statistical theory, ^a mathematical model' for causal/ inferences in 
prospective studies deve\Ioped and then applied to retrospective 
case-control Studies. Before developing the model, caiisal agents are ^ 
delineated, and causal effects jare distinguished^ from '^gains over > 
"time**. The foptoal model is presented considering indirect measurement 
'Of causal effects, homogeneous populations, intermediate-level causal 
effects, the selection variable, randomization and the.role.of . 
covariateS. . In the retrospective case-control studies, retrospective 
and prospective probabilities and matching are discussed. A loglinear 
model for a case-control study problem is presented^ (Author/CM) 

i 



*,*********** **Jllf ******* ******** • * 

* . Reproductions supplied by EDRS are thfe best that can .be made ^ * / ^ 

* ' . from the prxq^i^l document. - * . . 
W**^*i^********>*********************.t(^ \ 



xT—S 



o 
o 

y\ 

oo 

r-i 

•o 



. Causal ^Inf er^iice in Prc^pective 
"''i and Retrospfective 'Studies 



U.S. DEPARTWENT OF EDUCATION . , 

NATIONAL INSTITUTE pF EDUCATION 
EDUCATIONAL RESOURCES INFORMATION 

, CENTER (ERIC) 
^ This document has beon reproduced ds 
received from the person or organisation 
onginatmg it 
I Minor changes h*ave been^de tp improve * 
reproduction: quality ' 



• Points of view Of opinions stated m this docu 
ment do not necessarily represent official NIE 
position Of policy 



by 



4 



Paiil^W. Holland* 
' and • * 



Dongild B.* Rubin 



August;, 1980 



1^ 



This work was partially supported by the Programo Statisticsv:Research 
Project at .Educational Testing Service. P^ul Holland was gar tidily 
supported by graijt NIE-G-7S-0137 from the Nat^ional Institutes for^ 
Education. Donald Rubin' s work .was^'fWlitated By "'a Jpim^-Simofc 
GuggenWim "Fellowship. - , ' ' . ' ' * / 



2 

r 





Preface 



> ; ^. ^ It^is^an hbnor, to preset" this discussion 'paper at. the Jerome 



I- 



.The 



|i* • 'Cc^pfield Memorial Session of The American Statistical Association 
/ r t-^'iq^of oUr paper seems especially approp^ate for ^-i's- session singe 
, many amportant contributions to the, s^tfid^ of- health effects -from 
' •.£rasf>«e*±v^'^and"'retr'pspect:ive studies were made b^^- jeifome Cornfield, 
j " •• e%g« Cornfield (1951, 1956)'. ♦ " " " 



1 




!• Introduction 



.P?iilosophical discussions of causality can be far ranging and touch 
upon an enormous variety of subjects. '^The reason is^ the emphasis/ 'in - 
the. philosophy of science,- on the- uhcierstandirvg ^of ^lauskl mechanisms- 
IStat istical discussions *of causality are substantially more limited 



in scop^ because th^ contiibutionfe of statistics are jta the^easurement 
of .tfte,si2e o^ catlsa^.^ effects and 'no't to the undfers tending *of causal •* 
, tnechaiiisms** Thi^ distincdion^^is* sometimes expressed as "statistics can 
establish cbcreiation but not causation." .We feel our emphasis on 
measurement versus understanding is more .appropi'is^te ^because it 'focuses oii^ 
'.the things ^that- stat:(.stical' theory can Qoatribute to discussions of ' - ' 
causality^ rather chan\ on what it can not. It is perfectly possible to * 
measure a causal effect accurately without any understanding, what^oeye^,. 
*of the causal mephajiisms involved. The measurement of causal effects witwsw 
, understanding the causal mechanism is, o^ .<:;purse, a commonplace experience 
o^ everyday life, i.e» , -people arfe quit€ capable of using automobiles, 
oveti^,. calcAa tors 'and .typewrit ers^safely and effectively without . 
detailed or, in some cases', any knowledge of how tfiese devices, work. 
.On the other hand, precise^easurement .of causal effects often leads ^o 



a better understandlixgr^f th'e causal me'Ebanisms involved. 



In^thi^>faper: we. develop a mathematical model for ^ausal inferences 
iiTivr^pective studies that is .base^ on. the work. of~ Rubin •^IWS) . and 
we, then appljj it to causal inference in retrospecrtive^case-^c&itrol . 
"studies. Before developing this model, we sh^ll briefly delirieate what « 
ytk consider to be propJeriy called causat, a^gnt^ in a statistical' 
discussion, aiid's^hall also sharply 'distinguish causal- effects f rom "ga.in^ 
over tlmeT— two Ideas that our exjperlence suggests -'are often confused 
with 'e'aCh %)tl)er. \ ' ' , \ * * ^ t /^^ ' - 



■ Causal Agents . 



Consider' ^hes following two statements: 



(a) , "That person didn't 'do well on the exam because she did not , 
study fifst."' , , " ' ^ ; , 

(b) ^*That ^rson didn't do well oh the .exam because shje is a woman." 
In statement (a) , the implied causal agent is the amount of ^ 

, studying';' tt^at is, hkd the person studied harder, she could have done 

better on the test, 'In t)ther words, there* was a poiat in time, when;a 

' \ ' ^ ^ ' ' ' ' , ' 

choice was^^de either tq study or not. to study. and the comparison 

between the subsequent scores on the exam is' the causal effect of - 
studying verst^s not studying. 

Statement (b) is statistically very diffej^ent from statement (a). 
In* that'there is no Choice of levels of a causal agent .possible, i,e,, / 
the person' cannot choose , or be assigned to be a male or female.^ ^ • '* 
•Consequently, there is no "logical .c^bmparisdn possible for a single 

" - , ' - ■ - / 

indiviciual between t;heir score on l:he test as a female and their score - 
on the test -as a male. / The use of cause In §*£atement '(b) "reflects" only the 
correlation between a tftribut'es, of , individuals. In medical studies the term 

r - ■ - . ■* f ' T ■ . • • J, . ■ : ■ 

"risk factor" is sometimes used broadly to encompass both causal agents - 
like smoking which can be alt^te'd and' Individual fat trlb^ut^s lilce age ^ 

*/ ,and sex which can <ibt. The' identification ^of gender or "other individual 

^ ' * • • * 

^tributes such as race as a c/usal agent in such questions is statistically 

meaningless.: It ma^ also* distract scientists from the ^tudy' 



..A.... 



of Mila^ agents tl&t can have Beneficial effects, e.g>, finding ^ \ 

. ^> ^ ^ \ . • . ' ^ ^ ; . ' '^^^ . 

programs of sfiudylthat ,arf particularly effective, f of women* and • 

• ' I ' ♦ * 

others*, that ^re- particularly effective for men. . - % 
« ' It is common xisage to say tbat the .levels* of causal agent are 
treatments, especially when tiheir assignment is under an experimenter's 
control. ' ' ^ ' • * 

Our definition of causal agent jls much sjtficter /than some, def initlons commonly 

^ " ^ " ^ ' / . ^ ' ^ . 

used by economists, e.g T-pratiger causalit]^ (;^an|erj 1969). We believfc that. 

to lable any successful predictor a -causal 'agent not only ^nisuses the language 

a^d tHus is deceptive, >ut also may lead researchers away fflm studyof the 

releyant scientific questions of the effect of manipulations 'that are possible. 



Gains versus, dausal effects 



^ In order to distinguish between gains and causal effects, co^'s^der 

a sj:udent who* was cdached for the Scholastic Aptitude. Test (SAT) betx^en . 

the first administration of the SAT, and the second.^* Let "the two scores 

.be SAT^ and^SAT^^-- the -subscripts indicating- the coaching that took 

place between" ^ministrations of the SAT. The causal. effect of coaching is ^ 

hot SA^2c -*SAT^.. This difference is the gain (or loss) over tliae in SAT 

scores. The causal effect of coac^iing is the differ.ence between the SAT 

sc^re at ardmi^istration* 2 given exposure to coaching, ^AT^^, and^the SAT* 

scfore ^t administration 2>^iven no exposure to coaching, say SAT2j^. The 

pre-coaclvLng SAT .score, SAT^^^ may be useful in estimating the causal ef f/ect 

o4^ coaching, but SAT^^ - SAT^ is not^e^fual to the causal effect "of coaching^ 

• uiless we'assuTiie that Si!LT^«SAT2j^, l*:^'; , a "no'change without coaching"*/ - * 

assumption. ,%The tenability of such "no^change" assumptions in general - - ^ 

dependsra great deal on the^ pafticCilar sybstantiye problem* un^er study. ^ 

• * • ' * . . " * ^ . / . 

It is , probably' false in this^SAT example. * ' . ■* . 



2. Causal inf.erehce in prospective 'studies ' - 



N 



. ' ^The logic' of measyririg the size of* causal ^efJects "Is Clearest' . 
*i - ». in prospective studied and so we shall ^egin with th^t case,. The 

. "* , ^ essential elements of a .prospective study 'are tl\e 'following: . . 

(a) a population of uilits, Q ' • 

;\ ^ (h) -SL set well-def.lned levels of - a causal agent (or treatments ) 

. ' to which each linit Q couXd*be exposed, " (For notational simplicity-, 

"we consider only two treatments denoted by '£=1, or £=2) 
' (c) a' response Y ;which can- be, Recorded fc^r each unit after exposure tio ^ ^ 

• • ' • • / . ' 

- . a treatment.- . . • ^ " ^ * ^ 

^ In a prospective stud}j, a sample of units from Q is obtained and, 

"the«- units are assigned 1;o treatments. The treatments are then applied, 

and later 'the, response of eacli unit Ih the 'stud/ is recorded. The 

J , Ihtultlve nofeion of causal effect^* that we wish to describe with oiir - ' 

^ ,/. * ' ' • , ^ ; ^ 

' ' model is the difference between , the response measured oA a unit that 

^ ' ' ' . ' . " ' - ' / . . * * / 

is exposed to treatment 1 and the response that would have been 

ft 

> measure4 on t:;he same unit had it been exposed to treatment 2.* Thus, 

^* ' . 

our notion of the causal effect of a treatment will always be relative 

'* • ■ ' ' > . ' ' . ^ . . 

to another treatment, and is defined f oi> each unit in Q. . . 

•This meaning of causal eff^ect is not foreign to statistical thinking 

^ ^ and is evident li> the writings of 'R.A. ' Ffshe'r '<1925^ , Kemp.thome (1952), s. 

' Coclixan XX96'5) and Cox C1958}, for example.* Although this^ notion of 

^ a causal effect can 'be 4eflned fbr eadh unit in.Q, in general, .we are * - 
; o ... . ^ , ; ■ * • i ' ■ , ^ 

n6t able to dit-fectly measure , a causal' effect for a single unit because 

. having given treatment cannot return in time .to give treatment .2 ^ 

Instead t Jhls is the fundamental problem of causal lijfer^ce, and our 



. 5 , . ^ * < ' 

* * * * * • 

formal model will showj how its .solution is related to the use of • • 
randomization and of covariates* . - • ^ • 

Before turning to. the formal model we need -to define, the nature of the 
response Y. For our discussion we will assume that Y is dic*hotomous, taking 
on only the values k=0 or k=l? The extension to Y" -taking values in an . 
arbitrary finite set , is straightforward. We have, chosen »t!b restrict Y 
to be discrete in order to emphasize the fundamental ideas behind the i 
measurement of causal effects without being distracted by the special - ^ 
mathematical baggage that is . automatically ass^pciated with continuous- 
variables ~ ire. ^ddit;iyfty, etc. * ^ ^ 

The formal model and definition of unit^level causal effect 

In our model', instead of a single dependent variable*Y, we have 
a- dependent variable, for eacfi of "the' treatments to which the unit could 
have been exposed. Thus, if 'the unit is exposed to treatment 1,' then , ^ 
,we will record" the value of Y for that unit. -If that same unit .had 
Ipeen exposed to treatmentjz' instead of tieatment*!, then we will- record 
the value of Y2 for "that unit and not the val^ Y^^. More formally, for 
two treatments , with eaSi- unit in Q we associate the following partially - 

» observable vector pf information: ^ ' - . 

• , . • * • . 



where . . • ' 

Y =« response made by the unit; i*f it is exposed; to treatment 
'The novel feature^of thi^ mo<iel 'is the intro^uctipn of s.everal v.ersions 



V 

o 



of response variable, Y, . There a version ^of Y for each *level. of ' 

. the causal agent b.ec'au«e our definition of causal effect ccftnpares ( the 

• ' > • ^ 

response made if ekppsed to level X) to Y^-Ct^he resi^onse made if« ex^tos^J to 

'level 2).' The fact that each unit has a value for both Y, and Y^-* 

is very important because it allows us t6 define causal, effects a't'^the* 

level of .individual »units. If Y = 1 and Y - 0 for a particular unit, - 
• * * J- ^ ^ " ' 

then the causal effect of treatment iL relative to 2 for that unit ilsj^to 
change the response for that unit from 0 *to 1. ^Rubin (1980} refers to ^ 
the assumption that the vector (1) fully represents* the possible 
values of Y under all treatment assignments as the "stable, upit-t.reat'ment 
value" assumption. ^ - ' * * V 

A question that immediately arises is VhetHer or not ±t j^s ever possibl 
>to expose a unit to mare than one* treatment and thereby directl/ observe 
mor^e'than one compon^ent of the vector in (1). .One can argue that .this 

JLs never po^ibl^ in principle, because once 'a unit has been expostd to |i ^ 

treatment:, the 'ijeqit is different from what it was -before/ * . » . ^' 

. ' ' - • ^ • 

However, the reasonaBl^ness of this extreme position depends on the , 

nature of the treatments and on the units under studj^. We will not ^ 

pursue this iss^e further here 6u\^vdl simply make the "worst case" 

assumption that! a unit can be exposed to at most one treatment condition. / 

* • • • . ^ ^ 

For our application to re^rosp^qtive studies 'this assump^tion is' adequate 

:since in these studies units are only exposed to one level of treatment. 

In going from^the partially observable vector in (1) to the ^ 

observable data we must introduce the variable S where S » it if ' 



the. unit is exposed -ta .treatmentr ^l; *S is the "selectiion" or • ' * ' 

^'"assignment" variable t^af indicates to which treatme^nt the individual 

is exposed, ^ - * ' * >^ , ' 

' The observahle data from a uni^ In Q is the vector * \ , -1 

. / • • o* ' ^ . 

• ' \ . , • . • 

* • • • ,0 

The notation 'Xg, is used because it indicates that /can observe*only 
Che. response of ^ unit to the treatment t;o- which ^t is e^pqsed, i^^e., 

^The quantity Yg is the observed valu^ of thfe response^and is therefore 
what-is usually called' the '^ep^ndent variable" in statistical discussions 
Ve never get to observe- Y tfrS ^ ^ince we can observe '^oniy tshe 
,vtlue of Y^ or'Y^'b"^ n*ot both, it.is^^ consequence of the model 

\that causal effects for individual units ar& not directly measurable. ^ 

. • » . *. . . 

Jndire9t measurement of the Qau9al effects is sometimes possible, 

and our purpose here is to analyze this, possibility for both, ^ , ' ^J^r' 

prospective and retrospective studies. T • . * *. • ' 

. In summary, our .idealized model for a prospective causal study 

^can be viewed'as based on the following.,. sequ^ce -of steps. , v ** ' 

(A) D.etermJ.nation of the population Q. undeJ stiidy," 

(B) Determination* of the treatments under study. - 

* ' . " * * 

(C) 'Determination of the response variable Y to be, observed* 

(D) Consequent definition of the^ vector C^^j^Y^) for every unit in Q. 



» • . . t 

^ - ' t £ 

• s . ** .. • 

s '.(£) Detenaination of' the assignment variable^ S f o? every urii^t > . 

. • ' • * ' ' , *' ^, , " 4 * ^ * . ' • \ ' " 

> ' V ' in th4 study. , ■ . * ^ . . " * ' " 

» . . . • • ^ . ^ ^ ' • / 

* (Fl Consequent definition of , the vector (Y ,S) for every unit in the 

Study. . . ^ ^ — r ^ 

. * * * ' * 

(G) Observation of (Y<,,S) for each unit in..the study,- , ' ' * / ' 
* « 

Indirect measurement of causal effects * • . ^ - ' • • . 

^ Although our definition of*tausal ef fifect. at the^'unit-ievQl ^ ^ • ' * ^ 

/ corresponds to mcJst ever^^day uses of the term "c^use" (e,g,; I " ' / ^ ' 

, didn't' do well 'on the jexam becau-se I didn't stuc^y) ^ scientific • , . ' > ' 
♦* • ' . i *. * * • ' • " 

studies often 'muS(t be content with measuring la'^^aker notion of *causkl^ ^ 

effect i • Irj^he populatio;i Q, supfidse there are n, (f-)' units for whic)i • . 

/•Y,=k 1=1,2; 'k=0,I., That- is, n-d) is the number pf units for whlfh t ' - ' 

-• ■ ' - • * - • ' " • ' ' 7 ° * 

If n. denotes the total number 'of units in Q. then the vector * 



,q.()l) =^(qo(^)',.qi(0.)r = (nQa)/n^, n^(z) /rip '..^^^V,*^ 



1^,. givfes'the distribution of responses under treettment 2.>^for the entire' 



* • population Q. A we4ker definition of-, causal, effect of treatufent 1 , * 
■ , rela,tive/to 2 i^ based on the comparJ.son of the -tx^o response- distributions*^^ 
' q(l). and',q(2). If, for example, q3^(l) > qj^(2) ,. ■ then the population-, 
level causal effect of 1 /relative to 2 is to increase the proportion 



of units in Q for which Y=-l. We ^all'call q{i) the 'ca\isal parameter^ » 
V;of**^thc.{9tudy. In terms of^-ihe distribution of Y^;Over Q \(^) be , 
expressed as ^ • J * • 



.Consider a simple randomized experiment. A r^ndcflPsample of units 

• • • * ♦ *^ . ' . 

.•firoiji Q kre exposed 'to treatment l\and the. values of Y, are obtained for 

them.* This gives us an estimate 6f' q(l) which has '%n accuracy that.- ' 

" . p XT'. ' • » " ^ ' * . " ' . 

deppnds"^ on the size of the random sampleC* A second, rand*om sample .of 

units f rom^Q isr exposed to treatment ,2 and the values of are obtained 

^for'^hem. This yiel^ds an^es^timate of a(2) A-'compaxison of these two 

estimaf^d causal parameters is a fom of causal iaferenc*e because it 

• * . -'-^ * ■ ^ • 

^arllows us to ^ay that treatment 1, causes a change in^ the entire distribute 
^ * \ it « » • 

of ^re&pfons^s *forf the units in Q rela^tive to the distribution of responses 

^ ' " \ ^ , ^ N • ' • ^ * 

under tr^featmenjt 2 by a given estimated amount. - ' » , "? 

\"' .* 

* . " . . ^. . % • ' \ ■ . - ' ^ - - 

^ \. H \ • • - - V . ^ 

Homogeneous populati6ns 

. A'popjilation-level causa^ inference is'weaker than a uAit-level 

• X / - 

f causal inference 'because it dbes n©t allow us^o say how treatments 

. . J . ^ ... 

change the response of -atly' single unit in Q except in tone very special' 

' * ' \ «*•*'*'* 

and important circumstance wh.ich we now discWs. If Q is such th>ft Y 

Jiakels ori^a single .value for all- AtVits and,Y2 also takes on a*single ! 
. * • - - * ^ . ^ ' . / • 

^value (that -is possibly 'different from that of Y ) then Q-will be- said. 

• have homogeneous responses for treatments 1 and 2. tTe shall refer 

' * ■ ''^S!^^ - * ' 

to'sqch a^Cffas ^ "homogeneous population^'. When Q^is a homogeneous 

popul^iton, then the populat'i on-level causal inference^is equivalent tp 

^unUt-level 'causal inferences for a^l the units in Q. .^or example, if 



• •■q(>^<' (qnCi), q-nU)) «/o,i)^ 

and iF ' - ^ X ' . ^ • 



V S 



•I' ^ • ' * ' 



/ 



: 10 



\ 



1 



f ■ \ ■ 



. ' .•■q('2) *-(qo(2), q^Ui) = (i,o) ' 

then ^treatment 1 x:hanges the responses of every unit in Q from Y-0 under 

treatment 1 to Y»l. . " 

Earlier we distinguished l^etwfeen individual attributes ahd 



^causal agents... Attributes^ oan be used to partition 



Q into subpopulations . Finding homogeneous sub populations plays an : 
'essential role in much of scientific research. In the physical sciences ^ 
the search for ^'identical initial conditions'* is really the search for 
' collections of units (i.e. populations) with homogeneous responses, ^ 
^ An "ideal covariate" is an attribute Cor set ' of attributes) which may _ be 
observed for each unit in Q prior to^ the onset "of^^^tShe. treatments and 
which defines subpopulations of-Q, each of which has homogeneous responses 
to the relevant treatment conditions* In practice, course, we must, 
often settle for less-than idaaT^covariates which only define subpopulations 
V'f hat ^are - relativlaly. homogeneous*. * / '.^ . 



'IflTatermediat^-level caudal • ef f ec.t.s •> . , • * 



There is an intermediate lev efl "between, unit- and* population-level 



•^/"tcausaL infererices. ' Consider ali*.of the units in Q which respond with 



tthe value under ' treaCftent 2. * We, may ask, in, what way does "treatment . 

' If /change the responses pf these -units? That' isj what is the' distribution 

> ' ' • ^ ^ ^' ^ * ' ^ ^ ^ 

a&W^lues of 1r, fpr' the^units 'in...Q with Y25k?, the a^iswejr to this question 

is .a more detailed causal irrifeirence than, a pbpulation-lev^l causal 

'dnijerence, and' yet;i^ i'n **Q'so. that * it is less detailed t- ^ 

than i'uQitS^levei .jca^sa^^ This intennediate-ievel causal 



inference leads naturally to* the notion of 'a causal-effect table for Q» 

Let"- ; . 



n^'j^, = number of units in Q for. which Y^rk and'Y2«k/» 



Since is the total nrndber of units in Q, 



(5) 



is the proportion -of units in Q for which Y =k and Y^=k'* In terms of the 
joint distribution of and Y2 over Q q^^ j^, may b*e expressed as 

Let q be the 2x2 matrix with entries q, , Then the row totals of q yield 
the distribution of responses under treatment 1, i.e., q(l) , ^d the 
column totals of q yield the distribution of responses under treatment- 2, 
i.e., q('2) . *'We call q the causal-effect table for treatments 1 and 2 
'in Q, Table 1 is a causal-effect table. 

Table>a^bout here ^ » - 



As discussed earlier it is of ten^ possible to estimate the marginal 
distributions qC^*), i-="l,2 using randomization. However', if is generally 
not possityLe'-to estimate the joint distribution q» This problem arises 
because of our fundamental assumption that Y^ and Y^ can never b'e 
simultaneously observed on any unit. The one situation in which q 
can be estimated a|*ises when Q is a population with homogeneous responses. 

The causaX-eff ecfc tallle for a homogeneous population is i'llustrated 

^ ' ' , ^ • '% *^ 

in Table, 2 and^we see, there that q* is uniquely, .determined by the marginal 

distributions q(l) and q(2). 

Table 2 about here • * . ^ 



... ... 



: J' ■ 



13 



. * 



Table 2 - 

^ausal-ef feet table for k population that 
Vi^s homogeneous responses under treatments 
1 and 2; '^O, ^ » i. - • . 



I- 



■ J ' 




Total 




^^^^ 



16 



1 



When Q Is not homogeneousV.it^may be possible to decompose it 
'into homogeneous subpopulationsv, an^ compute the causal-effe^ct: table . 
for each of these subpopulations . It ''is then possible to .accumulate 
these subpo^^ulation causal-effect tables to obtain the overall causal- 
effect, tab I'elf or Q. If it is pot possible to find homogeneous subpopulatib^is 
of Q then it. is not possible to form the 'causal-ef f ect tabl^ for Q \ ^ 

from its miargins because the' entries arV not-determined by q(l) and 
^ . ^ \ ' - . 

q(2)- r ' ^ : ' - ' 

SJ^^- we rai^ely encounter perfectly homogeneous populations iji 
, practice^,^ we may raise tjie Question of how constrained - is q if we only 
know Cor can estimate) ' the causal parameters* q CI) and qC2) ♦ The kinds * 
of constraints that exi$t are efasily conveyed by a few examples; thes"§ 
are given In, Tabl6, 3, The' margins of the^e.causal effect tables are^ 
considered to be known and fixed, and the range of" possible values for, 

the^ell e^ntries are given in parentheses • J.t is. evident .that if one of 

- * , . ' . , 

the cells in each margin is near one-, q is highly constrained. Wh^n- ' 

^.none of the proportions in q(l) and q(2) is larg^, {f;ls less i<;ons trained 

The general r^ule for calculating the ^ranges of values for these tables is 

• » .1 

. given by: ^ . ^ 



maxC6/qj^(l) + q j(2) t 1) < i^ '^^^ i-niinCqj^t (1) ^^['^2)) (6r) 



The selection variable and the-irb-le. of randomization 
. The caiisal-ef f ect , tabl^^ves the joint diistribution 



over Q, * The data.in. a.r^usal efffect study consists of ^silues of (Y^^S) for 
each unit *ln the sttidy. The joint distribution* o£ (^^,SJi"oveT Q does ' 



of and Y^ , . 



IT 



4\* 




15 



f J 



^ Table 3 





0 


■ 1 . 


0 


' CO. 10) 


cso.ao) 


■i. 


CO, 10) 


.Co. 10) 




10 


.90 : 

V 






« 



90 

10 



.1 - 





0 ^ . 


1 


0 


C40,50) 


C40.50) 


1 


, CO, 1=0) ^ 


CO.fO) 










50 


50 - 



■ 90 
,10 



' .J. 





0 


1 




T 


(0,50) 


(0,50) 


50 


t 

1 


(0,50) ' 


(0,50) 


•50 




. 50 


50 


t 

\ 



is 



* f 



16 



0 

,;not .determine the joint distribution of (Y-,Y ) r We may deccppose the 



/ joint 'distribution of^ (Y^.Sy into the conditional <Jistri>ution, o* Y^, ' • ' r r • 

:%ivea S and the -marginal distribution of S. The ^coiitiitional'' distribution 
, ^ of ^iven S is specif ied'' by tfU f ollowing^ prpbab iliUies : * ^ ' - ^ - 



t,_j « P(Y^' = k|^ = il'), 'i «4,2 and k 0,1 ' ^ -^7) 



The marginal distribution of S is specified by the following probabilities 



t P(S - A), A = 1,2'. ' / ^ '^ ^^(8)' 



The fundamental problem in a population-level* causal inferende (and ^fi, 
theirefore of afll stronger forms of 6ausal inferences) is the estiaJation- 

/ . ■ ■ • ^ 

of for i = .1,2. However, the only data we can olSji^in in a causal 

study allows us to estimat^e the conditional probabilities given in (7)» 

Thus, a question of paramount .importance in causal inference is: when ^re 

q, PCY. = k) and t, P'(Y = k|S = O equal? Thaf'is,' we are Jed to ^eek 

cpnditions under which ^ the following equation holds: 

^ / ; »v , « ,-, 

j^P(.Y^ » kIS , )IX « P(Y^ * (9) 

.There are two. very important case$ where equation (9) holds — random 
assignment and homogeneous populations* We discuss .each ^of these * 
brf e,iEl3£.fn turn. '■ ■ . K ■ • ' . 



fS- , s „ « _ .... 

• '-A :• • - • -H. ... 




:3 



■ f 



17 



Random assignment : If S is .statistically independent of Y then equation 
*.(9).must hold by definition of (statistical independence. How can S be 
made to be independent of Y^? miete ;ts np, way -to bfe absolutely sure that 
S is independent of Y^., However the process of "random" assignment;* of 
thtf^ values of S to the uni^t*-in Q makes it plausible^ to assume that equation 
(9) holds' Jit Q is l^ge. Thus, "under randomization we have - 

' \ • . . ' ' ' , 

^ , :|(Y^ ='H,.Y2 '- It'lS ^ £) » " l^') (10). 

and equation (9) .follows* The statistical independence .expreS'sed in. (10) 



is a very impor?tan,t point, in the justification of taniiomization but it 
is apparently not appreciated by numerous writers' on the 'subject. For 
example, 'it is often asserted that- there is a "dif ficulty|* in resolving 
randomization and the Bayesian/likelihood/modelling frameworlc (Basu, 1980; 
Keiapthorne; 1976: Kruskal, 1980). - However, equation -(9) i-s a fundamtotal 
one. for bot^ 'Bayesians and f requentists because it makes a parameter 

. -that can be estimated froi^ data (i»e. P(Y^ »'il)) equal fo-the 

* causal param^eters of -inteVest^,. q^^^^ 

One. source of conf vision -is ,th^ equation (10) does not imply 'that the 

\ observed|Value-f1fg , is independent of. S. That is, the following 

- • • " ^ '.. 

equation' does not hold "-'In general: . 



P(Yg. =. k|S)- » P(Yg = k) - . • ■ . (11) 



Eqxiatioiv (11) does hold\ under th.^ null* hypothesis that ^P(Y « k) does. , . ^ . ..^^ ^ 
u , ^ npt ♦cjifapend on the -level of exposure, Of course, fap.t isruiS6& 



not of tauch ±aterest to the Bayeai^ who^-wants to svmmrSlk^e0def)^^ 'v<^'^,>iisy 



1" o ■ ■ 




18 



'1^ 



Homogeneous population ; If Q is a homogeneous p^opulati^on, then equat.ion ^(9) 
holds trivially without any assumption about the dependence or independence 
of S and y . "'This is because in a homogeneous population is constant 

> ■ . ■ . . ' \ ^ . ... 

over ,units and const^ants are alvays independent of every random variable, t?^ 

Thus, for homogeneous, populations randomi2aj;ion is necessary for 

drawing population-level causal" inferences. 

In a nonrandomized study, it is often not believeable ttelS-5 is 

statistically independent of so that equation (9) may not hold. ^ 

Thus,! in 'a nonraudonjized study the observed values of Y^ are 'not 

i;epreserttative -of- the marginal distribution of Y. qver all'-^ 

of (J. HoweVe^* if Q is a homogeneous population, then equation (9) 

'must hold trijrially. Covariates. defining subpopul'ations play a crucial. 
' ^ . • * .* • 

role in^ ifoflrandomized studies of causal effects. First, the- subpopalations 

defined by them. can be nearly homogeneous in which case equation (9) almost 

holds within each. Second, within each subpopulation it may 'be plausible 

■ to accept the asstimptioa of cohditional independence^ between Y*^ and S; ' 

at^best, the^e may be no^data to^contradict this assumption. The next 

section ^cd]titf^sses thia i^sue in more detail. -r'-. ^ 



The -Role of^Covariates. 



* ^^-r,^uppc^s'e/t^at..^Q :ca4.^j5e partrtioried' itfto^ strata on. th'e basis of a 

•> / ^ • ^ ■ • y * 

\x covariate J(. - We 'may then, tons ider the possibility that equation (9) 

hqlds#in each -X-stratum- even though equation (9) does not /hold for all 
• of Q. That::?^s;'"We inay- ask'wKetn^^ , - , * 




: ^►'^ « ^ ; ^ , • ' 19 • 



' ' ' ' P<Y = k|S = £. X=x) = P(Y « k|x=x) (11) 

'^•'^^^^■/^^''^'^^^'^or sd^t^alues'rs'f^^ k and x? As mentioned earlier, there aire two reasoq^ 
' V '^y\^£P'*'*<" why' we^may 'be willing to assiame (H) even 'if we are not willing to assume 
S*^ (9)/ The first occurs when X is an ideal covariate and all the X-str^ta * 
ar.e themselves populations with homogeneous responses. ^Then wfe know^ 
that (H) holds automatic^ally. The secpnd occurs^when we faiqw or are 
- "willing, tot assume that ^ aiid*-.Y are independent given X. fee (nay be 



/ ^illinfe •to make this assumption for one of two reasons, ^he first is 

*• \* * --that we actually' randomly assigned the values of9#S* within each stratum. 

\ • , ^ ^. . , ^ . . , ^ 

^ V ' * * * 

*. The second is th^t we piay willing Xo make this assumption because 

* "'1' ' - - , > " \, ' ^ % 

theVe^is' nothing in' the data that will contra\dict it. This is a subtle 

Vv. ' : . . • ^ - / . ' ^ 

y^^^' ^ ^ point and* one that n^eds to elaborated. If we assumed that S hkd 

' - been randomly assigned and, was therefore independent of Y then this 



f 



assumption could be -iiranediately contradicted by looking at the distribution 

.... . ^ . . • 

of X given §. If S> had been randomly assigned, then X and S would be 
^ independent so that ' , - . * • 



p(x = xis.- « p(x = x). ei2y - . ' 

If w.e examined. the iilstributlon of X given S « V and paw that it did 
yary wipli the value of'£,/then we would, havg evidence that S was not , * vT^ 



5^ 



r.andoiftly assigned over all of Q'^nd. ^Tierefore^that equation* (9). does 

^ * - • ' r ^ ^ ^ * ' ^ 

not hold. ' However, if we assumed that>,S was randomly^^assigned , within 
'^r^^ ea^ih. X^strattm we coul^Jiwif'^Iieil use the obaerred distributions of X 



given t to disprove this assumption:*.. * • ■ ' ' 



20 



Now suppose that equatibn (11) holds. Ve may> use it to .obtain a 
basic formula for the causajL parameter, q,(il). We have " 

■ q U) » P(Y =k) = II. P(Y =k|X=x) P(X=x) ■ . , (13) 

■ ■ , ■ - ; -X . _ 

• ' * . * "... 

so . that . ' * ^ » * ^ < 

- - qj^()l) - E P(Y =k|S=)l, X=x) P(X-x) .. ''^ (14^ ' 

■ X - . . 

Equation (13) is a basic fac.t of probabilities. Equation (14) relates 

two q^uantities that can be eistimated, i.e. P(Y =k| S=% , X=»x) and 

* • 
P(X='x) to the causal I)aram'eters . Thus,'*if equation (11) holds we can 

/ * . • - * 

estim^ate the causal parameters and^draw populationrlevel causal inferences. 



3. Causal Inference in Retrospective Case-Control Studies 

The structure of a retrospective case-control study is considerably 
different ' fr'om the general prospective study df&cussed in Sectiojx 2« « . 
In a case-control study a population of people is divided into' those - c 
who have a particular sypptpm or disease of interest (i.e., the "<^ases") 
and those who do^np.t have the symptom or disease (i.e., the "controls'*) • 
"Samples of cases and contrjols are selecj^ from this population and 
information about -each selected pei:^n *is obtained^to ascertain: 
(aX the level of. exposure to the particular causal agent of Interest 

Cb)' other medically relevant* information which may be \use(i to 

' y ' ' " ' I.: ' ' ^ ' ' < , 

define subpppulations of unltsrt ' » 



The response ^variable for a c^se-control study the dichotomous 
ble that ind 
"CQntror\ i.e.,. 



variable that indicates whether or not the.* unit is a "case" or a 



4 

Y* \^ / 1 if unit is a, case- 

0 if unit is a control. 



Case-control studies are retrospective because they be'gin at the end 
point of a prospective 9tudy (i.e., observations of the response 
viariable for dach unit ia the stv^dy) and then^look bacjjf\^n tliyie to 

discover the level <^f cauial agent t'^ -which ^cB. unit has teen^ 

exposed (i,e,, the value of the selection indicator S) . In addition to 

this. •fundamental difference between case-control, and prospective 

studies, there ^re two. other differences that should be mentioned. 
* * . /* * 

First, since the investigator can only collect data- on prior exppsure ^ 
to tlie causal agents, of Interest, it; is lmpoas:^Me to employ 
- randomization to/assign units to levels of the causal agent. 
Thus ^ase-control studies are, never randomized. Prospective studies, 

*'^n the ^dther, hand, may or may not employ randomization depending on 
;.^he'amount of control that is possible. Second, the p'opulations 

St 



Studied in- case-control studies usually consitet of survivors only. 



because it is 'often impossible to obtain comparable data oh. . ^ < . 
individuals, who are deceased*. This limitiation can>'ave a^ Serious . 
effect oh the ihterpfetability of the results of a case-control .study.. 
We shaii assume 'for the moment that the. populations considered ar^ not 
/ subject to mortality- We shall return to^this point in the discussion 
of the example in section 4. 



Although in pfihciple it is almost always possible to formulat^a 
prospective version of a case-control study, it is often much more 
expensive than the ^$:ase-contrt)l study* There are^several xeasons for this 

(a) prospective studies often ^-eqjiire large* sample seizes ' - * - 

, ' 

especially when the "cases" are rare (e»g. , when Y^, » 1 represents a rare 
disease), (b) prospective studies-of ten involve. long tin^ spans* before 
elevant data become available. Hence it is likely that ''case-"ccntrol 



jtudies will always be an attractive possibility for many tygqs' of 
sjcienttfic investigations, especially •iil the , early stages of the research. 

,^ ^ , . „ sr. 



It is therefoft*" important to know their limitations, to 4esign them as^, 
well as possible- and: to analyze the data^collected in-^u^h^jpid^es . 
correctly. Our goal in the present paper Is to j illuminate a5|L- of , these 
points by applying the model fot causal inference iJev€ioped /Lh s-ectipn)|^' 
tq case-contEol studies. ' *■ 



Thi^ standard two-way table and why jLt is mis' 



In analysing 4ata fro 



in 



m 



14 




^e-co'ntrol s 




it is customary^ to 



fom and draw cpnclusions from /the two-w^y. table of jiojunts iilus,trated 



Table 4.^ We. assume tl\aj^'^|ifs/ table was formed by rindomjy sampllng^ 
''cases" and in^ "conttpl^" from the population. 



fablei 4 about here 



..IK 



I, Table 4% -mj^ is .-fhe' ^Lumber of units in the study ;for^ which Yg«k 
alid's«jl* For .exainpie, m-^ 'is' thC number of.^'^cases" ^n the s^tiidy \ : ^ 
t iatrwere o^serv^d' at the 2nd 'level of exposure to' the-^ causal 



ag^nt. Before examining this tabl^' of ^ample^-^ata', let 'us 
consi^^r the population table thfat underlies it** Thi9 population table 

gives' the population proportion of pebple f or^which ^=»£ among all those - 

' ' ' " . ' - i-- ' " " ^ ' . 

fbv which Yc * k, Ik ^i),l> ^ ~ l,2.'*'^lrhese populatidn values are -denotied b 

■' ^ . ■■■■ • - A ' -c^. • : > / ■■, .. ■•" 

and artay;ed as a population tabl^Hn Table 5. The sample ratio 



e^timatfes rj^. We shall call thet^j^ >the. retrospective pr'obabilities of 

• k • ^ * * '• ^i**-"" — ' 

the study* ' -"v , . " , . ^\ * ^ , * ' * ' • ' . 



r 

\ 



fatle 5 about here 



In this .development W-must "emphasize the importance of representing 

• ' \ •• . .' - . ^ ' ' 'I 

the observed valu^.o'f the> response as'^Yg. t. For example in (15) it^ould 
l?e incorrect to c9tiditjLo/i Y^' «k since Y^ is the' response ' • , ♦ " ^' 
madk^ if exposed^ tb treatment level wh^ eas Y« is the observed . i. 



response, Because Y- is being .Condit»ioned on in Table- 5, •'it is sometimes 

• .-^^ "•; ■ ■ -f.. ■; ' - * - . ■> 

.said that ixi a. case^.control .study exposujce is the dependent variable- . '* 
apd diagnosis* (l.e,^ case prjcontrol) Is. tlie independent vaTiable. ; 
This *descriptibn contuses the **scientif ic question of interest, and we 

- . . • . J. . • . , ■ . 

will not describe , the sltuatlpn iri these terms . ^ ' . 



24 



L^vel-of-exposure ' 



♦ < 








:s-2 

• 


Total 


















"cases" 






. ""iz- 






























• 






^ "controls"' 
» % 




I. "^01 


• "^02" ' . 






' - •4*'* 




Total 


^1 . 


^-2 


\ ' m 












— • 
















* • * 






I 






■> V 

h * J- 







, ;| 



■■Mi 



X 



Table 4: Thfe standard 2-way. sample table ihQwing the distribution- 
' of .caaes'^ and control3 observed at eacli. level ^o^.' exposure 



"to- tttft; c^aal :ag«it: 

^ ^, % * 




M>Wit»Mriifc.11> ..I.I. LI.,iV,., „n ,n, 4-..III,.., ill..... .a Ml ..li.., .il ,irtiu,M i.H.^ Upi i i^- . L . ■■ii. - 



25 




exposure 





5 « 1 
« 


o - " 


Total 


fee ^ 

, • / ' "cases Y^-l 

t-; • , • ^ r . ' 


'^11 

p(s-i|y -1) 


r - » ■ ' 
12 

P(S-2|Y =1) 
S 


* 

* *> 

. 1 ■ 


T' .*. . . J' ' . ^ 

?;v ' "controls" Ya*0 


P(S-1|Ys»0) 


P(S«2 |Yg=0) 


1 






c - 


? 



■^l* V * 




Table 5* The population table of retrospective 

probabilities r^ that underlies the sinple 
. ' table' in Table '4. ^ 



■■.:t 



^40 



^ A 

• > V '";v' 




26 ' . • 



If we consider the weakest level of causal inference, l.d., a population- 

«v " . ' . • . « 

level of causal Inference, then the causal parameters ar^ 'the marginal >; 
probabilities P(Y *1) and P(Y Thus, the retrospective probabilities^ .■: ^ 

in (15) are not, in themiselves, of any causal interest, because, at ^ ' ^ 

the very least,' they describe the wrong ^events. However, by applying 
the usual rules of probability-*, we may reverse the roles of* S and Yg in - • 

^ (15) and obtain more interesting probabilities. This reversal. is the 

» * » ^ * • 

usual justification for ever looking at Table 4. . ' ' 

Relating retrospective and prospective probabilities 

Xo fwerse the roles of S and Y^ we make-^se of Bayes theorem to 

*^ \ ' s , . 

obtain ' . ' » ' * - - 

^P(Y »k) ■ , . 

• P(Yg-k|S*0 « P(S«jl|Yg-k) p^s^ , • ' ' 

' . ■ - - ■■ ■ ' • ■ ♦ , 

. However^ ^ , " ' . 



V .P(Yg-k|S*£) « P(Y^«kiS*il) , • * (18). 



. . ' sb 'it follows that, - - • - , 



\; 




'-.Mill 



27 



Jl^nce, in order to transform the retrospective probabilities r, in 
(15) and Table 5 into the more interesting "prospective" • 
.probabilities, t./, we need only multiply the tries In 
Table 5 by a, roWv<factor *(i,e, , a.) and a column 'factor (i.e., bj. 
Wo have "illustrated the array of "prospective" probabilities of (18) 
in Table' 6, , ' 



Table 6 about her^ 



Note that 



^s6 Jthat the'prospectiYe prababilities in (19*) can, be calculated from 
knowledge of a) the retrosp^tive probabilities r, . and b) the overall 
proportions of casein and" controls* in tlie population =» P(Yg="k). 



The cross-product ratio for Table S may be express^ed, as : 



a* 



^11 



'01 



PCS^-al.Yg'l) / P(S»2|Yg«0) ■ 

•p(s=i|Yg-i) ./ p(s-JiYg-q) ♦ 



^nd the cross-product, ratio for Table 6. may be 'expressed as: 



■ P(Y2-l|s-2) / /P(Y^-lIS">l) 



(20) 



(21) 



: Because- TaSles 5 wdd 6 are ifelated^via ro'w and column multiplication, 
it Is w^li-imc«m-(e,g; Msh6£, Flenberg.-HoU^ (1975)')^that 



''-f 



w 



id' ' ^ 



(22) 



5 ^ 



?8 



L.evel-of -exposure 





•s-i 






- 


"iases" Y-1 




p(Y^»l|s»2) . 






••Controls" Y-.O ' 


B(Y^«Ols-l) 


,P(Y^-d|s-2) 




t . V 


Total- 


1 


/ 1 







"•Table 6. The popula^tion table of prospective probabilities 
that may be derived from Table 5 *b^ rowand coli^mn' 
multiplication,' ' '*v \ . . 



a 




N. ^ "r t-rf , 



Population-level causal Inferences 

, Nqw let "uis' return to the question of makings' population-level 
' causal inference about the effect of the causal ag^t^D^ 
proSabiiity of becoming* a "case." The parameters of interest in such a 
causal inference are Ipe causal parameters q^(^) = PCY^^^k) or, equjLvalently, 

' the odds associated- with these probabilities'^ i.e. ' ' 

' > - "... 

. " , ^-^^^ " p(Yj^=o) ' (23) 

. — ■ ' - ■ • • ' ^* 

£ = 1,2. The odds in (23). for i=2, relative to- 4=1 gives the '-odds ratio ■ 



6(2). ^.(^2=^^. /Vl^ hl^^'^ 
I •P(Y^=0) '.q^ 



e(l). P(Y2=0)/ •P(Y^=0) ,qQ(2) / q^d) 



(24) 



Even though a represents less information than both B(l) and 6(2), interest 

often focuses 'on the odds ratio in case-contrjDl studies^ Certainly a does 

• * * ^ ^ • * 

give a measure of the change in q(2) rejj^ive to q(l). ' ' ' 

If we could, assume that S and (Y- ,Y)) were^ independent , then it -would 



follow from (21) that a and a* would be 'equal. This 'would justify 
examining. Tabl^ .4, because , the cross-product ratios directly estimated by 
tills tfable (i.e. a*) would be equal to^the cross-prpd'uct ratid of the 
causal parameters (i.e. a). However, case-control studies are* non- 
randomize^ studies so~ th^t randomization can not be a generally 

satisfactory basis for assuming that S and (Y- ,Y^) arfe Independent . 

' ^ ^~ \ i ~ ' , • ^ ' /, ^- ^ 1 ^ ^ * 'vjrji^ ^•■ 

FurtheraorV, :by exfml the distribution of a covariate *X given S«jl, .we^ 

catn of teii convince ourselves in a case-controf. study 4:hat S was .not even 

' i \ ' " . .. ^ * ' . . ' . * 

..V ^ ^PP^P?4%Mi?^^^ is essential in case-control 



if:- ■ 



30 



Studies to examine more detailed asj)ects of the data than those which are o 
summarized by Tdble 4 In order, to hawS^ some hope of drawing reasonable conclusions 



The odds ratio a* in a case^'control study may not equal a due 
to the self-selection of ^individuals ,into ^exposure categories . We 
conclude that basing the analysis of a case-control study on^ Table 4 is 



' potentially misleading b.ecause it ignores the possijjility of bias due.^ . 
to self-selection into the exposure conditions. .We. hasten to point'p^ft 

^ tfiat .since population-leVel^causal inferences are the weakest of the 
three types of causal' inferences ^we discussed in s^'ction 2, it*' follows 
that if popul^tion-flevel »causalc. inferences are impossible from the ^data 
in Tab-16*A, so ane; all 'oth6ir' types of causal inferences, ' ^ 

The»role of covariates^ in case-control studies. ^ 



' If tliere is a covai^iate • Cor set of covariates). X which is 
^measured on each unit in the study, theq we. may form a table like Table 4 
for each value of ,X. Let. m, . be the number , of .units in the study for 
.which Yq » k, S » £ and X' » x. These are ariTayed in Table 7 f or X«x, * * . 



f,% ' 



Table 7 about here 



The ratios , ' • 
' estfiinate .the populati.on retrosp.ective probabilities 




33 




K25) 



(26) , 




'31 



• Value of X»x 



levfel-ol-exposuiTe 



• > 






S » 1 


S =^:2 1 

■ 


total 


"cas.es»'', Zg»l' 








"controls"' Yg=0 


* 


i 

/"o2x: 




% 

• 


• 


. total 
















1 


\ 






















* • 














\ ■ 






^ * * 

4 .'if , 




• 






- \ - 




The» distribution of cases 


and controls 




in the sample obsferved at each'deval of* 
* * 

exposure to t^e causal agent, for X » 



\ 
\ 



;/ I 



0^* 



-km 





'and 



(27) 




-.1 



'■'i b. .= l/P(S=*Jl|X=x) ; 



(29) 



. ;cBy the same argument given 'for a* 'and a** we have 




is- 



46 



33 



.if it is reasonable J:o asjSLume that C^2>^2'! ^ conditionally 
independent given X«x, then ^ ' ' ^ * , 



-a* = a* 

X X 



(33) 



where 



P($2=l|X=x) / P(Y^=l|x=x) 
°x T P(Y2='0|X=x)./ P(Y^40|X=x) 



(34) 



"On the. of her. hand, t^e cross-ptoduct ratio that is determined'by the 
^ \ - « ' ' * ' 

causal parameters is a lii equation (24),. TJie relationship between a and 

the values of a isenot a simple one due to the nonlinear form of 

9^oss-prd3uct raCip'. J6t; example, t^e average value of a over the 
'distribution of X does not equal a -in general. There is no simple 
analogue to f ormul| . (14) for the retrospective cross-product ratios. 

1 Suppose a* » for ^11 values of x. If this happens then we shall 
say.^that^ the data in; the*^ study exhibit a constant cros^- 

If we • 



^•product , ratio — i.e., a* ILs Coiist^nt across all values of x 



ar4 wi^ liixg ;t$ further assume that (Y. >Y^) and S are conditionally 
/indeiLiident*« X then - ' 



a* 

; X 



".0 



^ (35) 




34 



EVen .when (35) hoWs there is still no simple relationship between 
and a> Th6 general' formula-^relating ct^'to a is given, in (36) • 



r U 



»^qo(l|x) + 



P(X-x) 



(36) 



where 



1 



(£|x) « p(y;-- k|,x=x) : 



(37) 



v We note that the:, causal, parameters appear in (36) along with 

their conditional yersions .qj^(l|x)» The example in Table 8 shows that 



ag^and_a nefed not be equal." 



TablQ 8 about here* 



• All is not lost however,' becaiise ct^ ^ a causally interesting 
quaiitfty its^elfv It:*iS; the , Mora which the odSs' ior ^Y^^ » 1 is 

1 ^ ' ^ ' ^ ^ , ' 

:inQreased bVer Yt 1 each X^tratum of ^us' al 



is a^useri^^^^ has, causal relevance 4n 

^ ^ each ol the:: sul>^^^ Since ax is specific to the :X-s.trata 

" of Q, it' pt^videsf^cfusaL inf ere^ the effects of the levels of, ^ 

the>causai;;agimt\,lB^^^^^^ detailed level. ^han populatlonn , . 

- . level ;causalt#nferenceSivv^:Hbwe^ strqrig as th^ intermediate- 




In section 2. 



\ .;:g 



,35 



Table 8: Example showing that a ao^ 
• % o ' • 



le-l 



fc«0 



V — 

Total 



P(Y^«k|x=^l) 



* 

X taki^s, on two values ' X"l, X»2 

• * 
x-i- ' . 

-P(Y2-k|x»l) 



1/10 



-~9/io :^ 



5/10 



5/10 



X-2 







P(Yj^=k|x?2). 


P(Y^-k|x»2) 




• 


k»l 


1/100 ■ 


1/12 '.. 








i 

99/100 • 




11/12 


<> 




Total 


1 - • 


■ \ 


' ) 1 


i 



«2^^' 



'therefo.^e a n 



o • 



If Ptx-^l) p .1 and P(X=«2) » .9 then 



I 



fey-- 





E(Y^-k), ^ 


P(Y2-k) 


*'' k « 1 


19/1000 


1507j.200 


• . .k - b 

^ 4 

4 


981/1000 


, 1050yi200 - / 


„Total 


' 1 ■ • 


1 . ■ . v 






, If p(x-l); *. 


.5 \, and Hx«2) » .5 then 




'^?(Y-k> 


P"(Y2-k) / 




'11/^00- 


70/240 




. 189/200,^ 


• 170/240: 









/-•:/- 



t > 



'iM 




0 



''""mm 



Ouif,. conclusion is that in a case-control study the simple 2-way 
table (Table 4) usually holds no causal interest* The only hope is to 
s stratify on covariates and to estimate the a** If the*, stratified table 
exhibits constant cross-product ratios fhen thfe strongj^t fom of causal 

inference appears to.be to e^imate ag and assume tliat it equals a^* . ■ 

These latter parameters give the amount that the second ^level of the 

cm • ^ ' « " '^M^' ^ ' • 

causal agent ^increases the proportion of units dLn*each X-stratum that 

**^re "cases'* relative to the first level of ' the causal agent. This 
, • <^ • * ' •* » 

' "amoun^ of incrdftse" is in terms of the odd$ corresponding to'the proportions, 
* thus, for example, f5r a given *^lw pf the propQrt^ion P(Yj^«l| X=x)"^ we ~ 
calculate P(Y*2*i|X«x)^via the formula - ^ \ ; 

***** " ' ' ' f , 

Comparing this to the given value of P(Y^«l|X«x) |Lead^ to a causal 
, inference about the effect of the causal agent w^en^X^x* ' : 



* ■ - ^ * 



a-. 



.y, ^' >%- - ' Prospective V3> retrbspective matching * , - * i^' 



A. 



Another vay to see tlie fundamental, weakne^^s in retrospective stutiies. . ..^ 
Is TO compare prospective matching j ^^^icli^m^tches an dqposed.ahd unexposed unit ' I 

' ,with*resp^ect: to. X>v .and refriospccti^^ -matches a qase and a . ^ ^ 



^^-X, SO %hat at>each?ievelrb^^ ^ ' ' 

ivji. .the'.e^ witVblo^ defined by X. 




matched pairs t Thus 



t 



i 



for X whenever ,both members ~ 



prospective matching on X perfectly controls 
qf each Ti^cKed p^ir have the same values of X. 



: . ' In contrast, retrospective) matching 'bn X in general cannot perfectly 
.control for X because -it sloes rot reconstruct the randomized block experi- 

•ment, Jn each^matched pair;, one member is a 'case and one membea: is a 

' ' • ..J- ' <- ' . , ^ , • * • , 

^ ' ' control;, to reconstruct the rarilomiz^d block exp^eriment^ .on^- n^ber must 

• be exgpsed and one tilnexpased, ^hich generally does not occurvhen one^ * 

member is a case and the pther accbntrol* yXhus summaries from the- case^ 

control matched"is)^pt6rsuc^~a^ do not tepresent ~ 

■ ' • ^ 

- . ^ ^ estimate ;fpij whijCh X-has been controll^, even when all maftphed. pairs are 



exaiitly matched with respect to X. \fith, retrospective matches,* }re really 



' . need to estimate the crcfss-produi^ratio in eaoji matched pair, and t^ls 
, * -requires Jbullding a models relating Of. ,Y« to X and S.' We illustrate this 
... ^in tfee^-next. section^. ' ^ ' • ' "^^^ - T*. 



- V'^^V M Example • ^ - . / , V . - • 

^ . ^ 'V^ ^ gfr e ^olloying data are taken from a case-control study of the» 
, , •* 1 ^^"t ' ' ^ ; ^^^.^^ -Vs ' * • » ' , I 

. \ ' relationship of coffee drinking and qcturances of myocardidrl infarction^ 

y v/> ;\' .'ii^/v^ ,.:.>vr' ^ \ * • . if 

• I \(MI) bf^Jlck et^l (1973) •^ .^We U3.e:^these data for illustrative purpose/ 

(had an JtfX) 



standard 2^way/ 



J ' .only,, \A;'tafeal^qf,;24ii7.41 pati^Cs T^ere classified as ''cases'* 
\ \ - 1 or •VQntrois'v<did nSt h^ve an MI) • t Table 9 shows the stand. 
1;:; .^^l^^v^t presets thevcas'es;a^ controls cross-classified by the ^ ^ 



reported daily coffee corisumptior 
ly two .levels *of the 



f , .causal, agent ,:^Tm^ the extensions needed ito 




If'" V" ■ 



- .por« - o[s '■lyl PCY, ols - i) 

" • " .-t • ■ A. . X ,j ..t-, 



Table 9'*about' here';- 



Table. 9: suggests a modest iiicreaae in the risk of MI among persons 
who drink coffee, [Tbe odds ratios range from 1»5 to 1,8,' The cross-product* 
ratios ej^bited in Table 9 are not monotone in the -^ount of self^reported 



^5,' -w, ^♦'i. v,vr4 ' " ■* - 



Qoff ee drinking ai^d^the^ej^fecrtr seems to be almost as sttong for persons 
" \ who drink 1-2 cujs per day" as for those who dri1i|: 6+ cups per day*, 

ftoweyer, ^^^^^ 9 doea not tak^j/arjiHt€KKackgroi^ -v^ 
, ^account and, as we^have discussed earlier, aiS^eforg^is like],y to be 
.^/misleading because (4)^ lt^^i|^not re^sbnable to ..believe that * the drinking of " 

•coffee- is independent of other relevant factors, and (2) *'the 
r^liatching^.pf cases *4nd , cpntrdls on^ hackgfound*\ariables does not control 
for them, in addition :to the varlablerS » level of ^elf-reported'' coffee - 
intake and ^ the following, set of variables '^were also 



rayallable o^^ p^lents ih^^^e stt}d;^v; 



CM 



-.-A-'- ;Age: .'^levels:- ■2(P2,$^;,3j0r39S^-. 70-7|9. • , • •• 
Il^l^f^ ^ - • i4;;^'4G«idit-:-.fiHl^e^^ -^^ --^ 






39" 



Table- 9j. Crossr^Taiulatiom of^. self-reported pof fee Intake (s) by 
cases^tod controls CY)'for 24,7^41; patients 
S » S'eft-repbrted coffee coijsiinption .per day ./ 



,,5 

1 



1^ . 





. 0 cup/daj 


S-2 

1-2 cups 


♦ S«3 
3-5 cups 


S«4,: . 
6+ cups 


. Total 


« 1 ' : m ^ 
cases 


128- ■ : i 


-269- • 


147' . 


.86 


630 


» 0> - non^rMI'- 
; controls :^ 


'^^•6918'- 

' ' .\ 


9371- 


. 5290 ' 


2532 . ^ 


24111 




■■■ =1 — 

!«. .7046 1 

■ ■ '1 
,i . 


: 9640 


5437> ' 


2618 


24741 



Estimated raw cross-rproduct ratios a* 021)*. relative to*i«l - 



a*C2) 
1.551 



a*C3X 
1.502 



1,836 





40 



s^'.' sparse table Indeed! Many approaches to simplifying this sort of ' 
, situation are 'possible;' We shall use log-linear contingency table 

. ^ '.• , ; ■ , . - • 

.models (a) because, of their direct relationship to the cross-product 

*' . ^ ' ' 

ratios, (b) because .tHey allov" us to see the effect of all of the 
covariates simultaneously: and Cc) because they do not force us t<r 
' . - Tely heavily^ on the sparse 7'-4imensional table • 



Logllnear 'Models for this Prgtilem 

'I 

Let X « CA,G,C,d,E) denote our complete vector^ of covariates. The 
retrospective probabilities from C 25 ) may be-exprjessed as: 



• " " "i(k) '^zco -^^acx) + "i2iMy 



(39) 



'^A3Cfc,x) ^ v23U,x) ^123(k,'£,x) 



^ where the u-»terms in (39) are assymed«*to satisfy '*t he ngtiAl AwnvA^l -ftcA 



1^ ^ 



id<Bntif-y-tag~constra'tots-u^^ « ^2C+)^ ^^^^ express <he. 



cross-product ratios 



a*(i),» ' ^lex ■ : ^ . * (40) 

. ^llx ' / '^olx 



'^'^ In terms of the u-terms ^Ih (39) ^ It is .easy to show, that the following . ' 
^ ^' ' ; ^ equation ' hblds^v .V^-.;-* * - " - ' ' ' - " , . - * 

p:;; : f «*<^^*^^^^ ^123(l,l,x) - ^123(o/,x) t ^l23(o,l,x)^ 

1^^!;;; 1 • ! * * Prom (4i;^^7ifc^fbll6w8^tWt 'th^ for . \ "^"^^f^ 

* * exaimpl^'. Bishop^ .Fieriberg jand. Holland,. 1975), specified by setting u-^^ « 0 » * ' r^rfSs 



41 ; 



for all X* Thus, we may Investigate the question of whether, or not the 

cross-product ratios a*G£) depend on x by "testing » three^w^y interactions 

of th^ various covar^fete^^in X and/w;l<:h,.Tg and S. Furthermore, if a 

;^,%n¥>del where u.^^ »°0 is jiccepgable*, the estimated- u,^-4:erms may be ifsed ^ 

to obtain estimates of a*(£) . I^..we- are willing to make the assumptions 

- f ' > ■ 

necessary to insur% tTiat/d*(£) «|a (£), where a (it) is the causally 



relevant parameter discussed in Section 3, then we may test av'(^) « 1 
(i*e. , no effect of . different levels of the causal agent) by , testing 
that u^2 * ^- '^is ^^3^ will,, adjust for the distribution of the covariates 
in the several*^ expo sure. groups. ' * • ' ► 



' .simplifying, the analysis ' s . - . , 

As d^scHbed abqve it may -seem as^ thoiigl^we are .considering the whole 
> 2x4x1728' table, but-^one Important feature of the use of log-linear* models 



to^do so* 



lis that they;do not foyce this unless there^^iis sufficient data 
* 'Instead we br^alf-up X» (A,G,C,'0,H) into various i^inargtoal distributions and 

*expand..thc model ln/(39) to inake use of themi' In: the present example we 
' expanded the*^ table toMthe full seven-dimensions, <but. only fit. effects for 



the follpwlng,.pairs and; trijples o^varlableW': 



(u^^) HS/AS/GCS/GOS/COS/ \ ^ 



.7 i 



' ih^ ur^^tms In "parenthesis indlcX^jwhich terms Stn (39) have Seen expanded 





... _ ^ . . ,^ • " ^ .\ 



Results 



^ If we fit the log-llneaif^^ model indicated by the pairs and triples of * * ^1 



variables in (44)' an'd then defete tjie SY terms'^lind refit' the^model,' we * 
obtain a likelihood patio test of a (1) 1» ,The value of the* likelihood ^^^^^ 
ratio statistics is 12.3 which- binder the null hypothesis has 3 degrees ^ 
^ r of freedom^. Thus, this analysis results in a significant relationship * . - 



between coffee-cdnsumption'^knd myocardial infarctions.- The estimated 
, a (-C) values are ^ 

. • • ' . ^ ^, a'^(2> a^(3)^ a^(4) ^ ^ ' ^ 
» . >' 0,0, o . . » 

■ ' ^ ■ 1.188 . . ^ 1.235- • 1.719 . ,.' ,(45) 

as opposed to the raw cross-product 'ratios given in Table 9. These 
adjusted cjpss-p'roduct: ratlLos are monotonia in the amount of coffee 



-5 



consumed and the major*^ effect is -seen to be for high'MevelsN of coffee 



J. . . '/ ' • . consumption.; . . - / . " ''^ . ^ ' ' *^"f 



^r-' ' "\ To* study the question of wl^ether a'(£) varies with x we-?fit 5^ ^Ji 

.additional models each of whic^ppplafc^s SY ii^^X44)" by one of 'these ¥ . \ 
• ' , ttiples of variabiles: • «SY,.^ASY, GSZ, CSY, o.r QSY^ The likelihobd rat^o / ^ V/4| 



6Si!t,>v - statisftics for these mpdeisV t^he degree^of freedom and ^J^ined sig 



: ^..S^" - -.^riTabXej^lo;^^ here - - ' ^ ^ ^ ' ^- ^ -"^ ' ' . \^:^^tsl 



Isl*- iienificant, jlTai^' result \ 




Table 10* 



Summarsr of ^tudy>^of dependence of a (£) on x" . ^ 



^V^. 



' Interaction with 


df 




^ Level attained , .« • 


A » 

/o • : 


• .'69 

' 

3 

. . 6 

-3 


"79.39 
10.31 

» \ 

3 •97 


.20 . , 

■ ^ - .is - ' '.. . 

• .25 • " 



■ Pi 






REFERENCES 




' , Basu, .D.^,,/'Ra^pmi2atiOQ'^ Aiialysis of Exp erliiiental Data: The Fisher 
1/ ^ ' Banaomization Test", Uournar of the American Statistical Associat ion, 

Bishop, Fienberg, S.E., and' Holland, p'.W., Discrete Multivariate • 

; ^ Analysis:. TKeory and Practice , The MIT Press^^ Cambridge, Mass. ,^ 197i5. 

' Cochjrkn^- W.G*>/''The Planning .of Observational Studies of 5uman Population," 
Journal. of the Royal Statistical Society , .Series A > 128 , Part 2,- 

234r255^ discussion 255-265, 1965, * ' • * / 

• ' " • - - 

Cornfield, j;,. "A n^ethod of 'estimating comparative rates from clinical 
data, application to cancer of the lurig,* breast '^d cervix",,, . 
Journal o'f^he National Cancer tost i>tute , 11, 1269-1275, 1951. 

Cornfield, J./*A statistical' p^oblfem .arising f rom\rettospectlve Studies" 
Proceedings of the. Third Berkeley:,gymposium » 4, l'35-li8, 1956, 

' Cox*, 'D.R,*/ Plannlitg of Experiments , John Wiley & Sons, -Inc. ; N^w York, 1958. 

- .Fldher, R.A., The Design *of Experiments^ ,- OllW & ?oyd, Edinburgh, 1935, ' * 

- Granger;, C,W, J,, ;"Investig4tiiig -Causal Relations by Econometric Models and 
Cross-rSpectyaa\Methods>" EcononifitiMca > 37, 424^38^. 1969, 

i ■-■ , 'J^:^}^ — — — - - — " ^^r'- ' — 

Jick, H. et* al«- "Coffee and Mycard-ial Inf area t ion", New England Journal " 
.^ofJfedictoe^289.,.-;^3*67, 4973. v ^ . . ilMi 

^j-— — ~,~ / . - ^ , / /~ % > ^ 

Kemp theme, X)/, The Design and Analysis of- Experiments , "John Wiley & Sons, 
' - , • Inc. ^ New York,, 1952^. , ' , . 

5%: ■ . - ^ , • : ; ^ - - • 

■*: Kempthprne,. 0.., Di:*uss'ion of ''W rereading- R. A. Fisher hy'LeonM 
Si 'l . • ' Sayafee.'' the',Annkl8 'of Statistics . A ' 495'r497. .1976/ * - - ^ -5^1 

ai, WV;. VT^he-Slignlf fcance of^^^ JduroaL^of the Merjcan ' * -Vlll 

S t^tlahtical: AssocEtlbn. .- 1980 . / -.. • • y , -^-^ ..f-^ 

^^S'--'- ./ ;|^?^f ^S^y -§5?^^ .... 'V : 'iji 

• • ,. ....^ 

■ q^^"JlandomJL26Xion Anal^^ o'f .Exper4mgntal. Data: , - . . .4'!^P 

'a. 





